Spatial Model fitting in GLS

In this exercise we will fit a linear model using a Spatial structure as covariance matrix. We will use GLS to get better estimators.

As always we will need to load the necessary libraries.


In [1]:
# Load Biospytial modules and etc.
%matplotlib inline
import sys
sys.path.append('/apps')
sys.path.append('..')
sys.path.append('../spystats')
import django
django.setup()
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
## Use the ggplot style
plt.style.use('ggplot')

import tools

Use this to automate the process. Be carefull it can overwrite current results

run ../HEC_runs/fit_fia_logbiomass_logspp_GLS.py /RawDataCSV/idiv_share/plotsClimateData_11092017.csv /apps/external_plugins/spystats/HEC_runs/results/logbiomas_logsppn_res.csv -85 -80 30 35

Importing data

We will use the FIA dataset and for exemplary purposes we will take a subsample of this data. Also important. The empirical variogram has been calculated for the entire data set using the residuals of an OLS model.

We will use some auxiliary functions defined in the fit_fia_logbiomass_logspp_GLS. You can inspect the functions using the ?? symbol.


In [ ]:


In [2]:
from HEC_runs.fit_fia_logbiomass_logspp_GLS import prepareDataFrame,loadVariogramFromData,buildSpatialStructure, calculateGLS, initAnalysis, fitGLSRobust

In [3]:
section = initAnalysis("/RawDataCSV/idiv_share/FIA_Plots_Biomass_11092017.csv",
                        "/apps/external_plugins/spystats/HEC_runs/results/variogram/data_envelope.csv",
                       -130,-60,30,40)


INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Reprojecting to Alberts equal area
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Removing possible duplicates. 
 This avoids problems of Non Positive semidefinite
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Fitting OLS linear model: logBiomass ~ logSppN 
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Subselecting Region

In [4]:
#section = initAnalysis("/RawDataCSV/idiv_share/plotsClimateData_11092017.csv",
#                        "/apps/external_plugins/spystats/HEC_runs/results/variogram/data_envelope.csv",
#                       -85,-80,30,35)

# IN HEC
#section = initAnalysis("/home/hpc/28/escamill/csv_data/idiv/FIA_Plots_Biomass_11092017.csv","/home/hpc/28/escamill/spystats/HEC_runs/results/variogram/data_envelope.csv",-85,-80,30,35)

In [5]:
section.shape


Out[5]:
(18414, 22)

Now we will obtain the data from the calculated empirical variogram.


In [6]:
gvg,tt = loadVariogramFromData("/apps/external_plugins/spystats/HEC_runs/results/variogram/data_envelope.csv",section)


INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Reading the empirical Variogram file
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Instantiating a Variogram object with the values calculated before
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Dropping possible Nans
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Instantiating Model...
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:fitting whittle Model with the empirical variogram
../spystats/tools.py:625: RuntimeWarning: divide by zero encountered in power
  g_h = ((sill - nugget)*(1 - np.exp(-(h**alpha / range_a**alpha)))) + nugget*Ih
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Model fitted

In [7]:
gvg.plot(refresh=False,with_envelope=True)



In [8]:
resum,gvgn,resultspd,results = fitGLSRobust(section,gvg,num_iterations=1,distance_threshold=1000000)


INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Building Spatial Covariance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating Distance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating GLS estimators
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Fitting linear model using GLS
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:RESULTS::: n_obs: 18414.0, r-squared: 0.898310288151, {{"Intercept":8.4466941208,"logSppN":0.3949644501},{"Intercept":0.0,"logSppN":0.0},{"0":{"Intercept":8.4204865441,"logSppN":0.380414251},"1":{"Intercept":8.4729016975,"logSppN":0.4095146491}}}
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Recalculating variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Refiting Theoretical Variogram
../spystats/tools.py:625: RuntimeWarning: invalid value encountered in double_scalars
  g_h = ((sill - nugget)*(1 - np.exp(-(h**alpha / range_a**alpha)))) + nugget*Ih
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Variogram parameters: range 114072.405376, sill 0.329197277814, nugget 0.313620600222

In [9]:
resum.as_text


Out[9]:
<bound method Summary.as_text of <class 'statsmodels.iolib.summary.Summary'>
"""
                            GLS Regression Results                            
==============================================================================
Dep. Variable:             logBiomass   R-squared:                       0.898
Model:                            GLS   Adj. R-squared:                  0.898
Method:                 Least Squares   F-statistic:                 1.626e+05
Date:                Mon, 05 Mar 2018   Prob (F-statistic):               0.00
Time:                        16:21:05   Log-Likelihood:                -16607.
No. Observations:               18414   AIC:                         3.322e+04
Df Residuals:                   18412   BIC:                         3.323e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      8.4467      0.013    631.737      0.000         8.420     8.473
logSppN        0.3950      0.007     53.207      0.000         0.380     0.410
==============================================================================
Omnibus:                     1129.303   Durbin-Watson:                   1.952
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1744.488
Skew:                          -0.513   Prob(JB):                         0.00
Kurtosis:                       4.105   Cond. No.                         3.90
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
""">

restricted w/ all data spatial correlation parameters Log-Likelihood: -16607 AIC: 3.322e+04

restricted w/ restricted spatial correlation parameters Log-Likelihood: -16502. AIC: 3.301e+04


In [ ]:
plt.plot(resultspd.rsq)
plt.title("GLS feedback algorithm")
plt.xlabel("Number of iterations")
plt.ylabel("R-sq fitness estimator")

In [ ]:
resultspd.columns

In [ ]:
a = map(lambda x : x.to_dict(), resultspd['params'])

In [ ]:
paramsd = pd.DataFrame(a)

In [ ]:
paramsd

In [ ]:
plt.plot(paramsd.Intercept.loc[1:])
plt.get_yaxis().get_major_formatter().set_useOffset(False)

In [ ]:
fig = plt.figure(figsize=(10,10))
plt.plot(paramsd.logSppN.iloc[1:])

In [ ]:
variogram_data_path = "/apps/external_plugins/spystats/HEC_runs/results/variogram/data_envelope.csv"
thrs_dist = 100000
emp_var_log_log = pd.read_csv(variogram_data_path)

Instantiating the variogram object


In [ ]:
gvg = tools.Variogram(section,'logBiomass',using_distance_threshold=thrs_dist)
gvg.envelope = emp_var_log_log
gvg.empirical = emp_var_log_log.variogram
gvg.lags = emp_var_log_log.lags
#emp_var_log_log = emp_var_log_log.dropna()
#vdata = gvg.envelope.dropna()

Instantiating theoretical variogram model


In [ ]:
matern_model = tools.MaternVariogram(sill=0.34,range_a=100000,nugget=0.33,kappa=4)
whittle_model = tools.WhittleVariogram(sill=0.34,range_a=100000,nugget=0.0,alpha=3)
exp_model = tools.ExponentialVariogram(sill=0.34,range_a=100000,nugget=0.33)
gaussian_model = tools.GaussianVariogram(sill=0.34,range_a=100000,nugget=0.33)
spherical_model = tools.SphericalVariogram(sill=0.34,range_a=100000,nugget=0.33)

In [ ]:
gvg.model = whittle_model
#gvg.model = matern_model
#models = map(lambda model : gvg.fitVariogramModel(model),[matern_model,whittle_model,exp_model,gaussian_model,spherical_model])

gvg.fitVariogramModel(whittle_model)

In [ ]:
import numpy as np
xx = np.linspace(0,1000000,1000)

gvg.plot(refresh=False,with_envelope=True)
plt.plot(xx,whittle_model.f(xx),lw=2.0,c='k')
plt.title("Empirical Variogram with fitted Whittle Model")
FOr the sake of brevity I didn't include the steps for calculating the covariance matrix and include it in the GLS routine. These are the results after fitting. """ GLS Regression Results ============================================================================== Dep. Variable: logBiomass R-squared: 0.900 Model: GLS Adj. R-squared: 0.900 Method: Least Squares F-statistic: 3.304e+05 Date: Sun, 28 Jan 2018 Prob (F-statistic): 0.00 Time: 12:48:52 Log-Likelihood: -34662. No. Observations: 36868 AIC: 6.933e+04 Df Residuals: 36866 BIC: 6.935e+04 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 8.4705 0.010 840.884 0.000 8.451 8.490 logSppN 0.3909 0.006 66.280 0.000 0.379 0.402 ============================================================================== Omnibus: 1933.532 Durbin-Watson: 1.906 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2819.569 Skew: -0.475 Prob(JB): 0.00 Kurtosis: 3.966 Cond. No. 3.75 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. """

In [ ]:
def randomSelection(n,p):
    idxs = np.random.choice(n,p,replace=False)
    random_sample = new_data.iloc[idxs]
    return random_sample
#################
n = len(new_data)
p = 3000 # The amount of samples taken (let's do it without replacement)

In [ ]:
random_sample = randomSelection(n,100)